Use dimensionality reduction technique – PCA and train a model using principle components instead of training the model using just the raw data.

Contest:-

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle(car,bus,van), using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:-

  • compactness
  • circularity
  • distance_circularity
  • radius_ratio
  • pr.axis_aspect_ratio
  • max.length_aspect_ratio
  • scatter_ratio elongatedness
  • pr.axis_rectangularity
  • max.length_rectangularity
  • scaled_variance
  • scaled_variance.1
  • scaled_radius_of_gyration
  • scaled_radius_of_gyration.1
  • skewness_about
  • skewness_about.1
  • skewness_about.2
  • hollows_ratio
  • class - Bus , Car , Van

Import all neccessary modules and load the data

In [2]:
#Import all necessary modules and load the data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import recall_score,precision_score,confusion_matrix,classification_report
In [3]:
df=pd.read_csv('vehicle-1.csv')
In [4]:
df.head(10)
Out[4]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
In [5]:
df.tail(10)
Out[5]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
836 87 45.0 66.0 139.0 58.0 8 140.0 47.0 18.0 148 168.0 294.0 175.0 73.0 3.0 12.0 188.0 196 van
837 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 car
838 95 43.0 76.0 142.0 57.0 10 151.0 44.0 19.0 149 173.0 339.0 159.0 71.0 2.0 23.0 187.0 200 van
839 90 44.0 72.0 157.0 64.0 8 137.0 48.0 18.0 144 159.0 283.0 171.0 65.0 9.0 4.0 196.0 203 van
840 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 car
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car
842 89 46.0 84.0 163.0 66.0 11 159.0 43.0 20.0 159 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197 van
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car
845 85 36.0 66.0 123.0 55.0 5 120.0 56.0 17.0 128 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190 van
In [6]:
df.dtypes
Out[6]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [7]:
df.shape
Out[7]:
(846, 19)
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
Original dataset has 846 rows and 19 cloumns(attributes)

To check whether dataset contains null values

In [9]:
print(df.isna().sum().sum())
print(df.isnull().sum().sum())
41
41
In [10]:
df=df.fillna(df.mean())

Replacing Null values with mean values

In [11]:
print(df.isna().sum().sum())
print(df.isnull().sum().sum())
0
0
In [12]:
df.describe().transpose()
Out[12]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.828775 6.133943 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.110451 15.740902 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.888095 33.400979 104.0 141.00 168.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.678910 7.882119 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.901775 33.195188 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.933728 7.811559 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.582444 2.588326 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.631079 31.355195 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.494076 176.457706 184.0 318.25 364.0 586.75 1018.0
scaled_radius_of_gyration 846.0 174.709716 32.546223 109.0 149.00 174.0 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.447743 7.468450 59.0 67.00 72.0 75.00 135.0
skewness_about 846.0 6.364286 4.903148 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.602367 8.930792 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.919527 6.152166 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
In [13]:
df['class'].value_counts()
Out[13]:
car    429
bus    218
van    199
Name: class, dtype: int64

There are 429 cars,218 bus and 199 van

In [14]:
plt.figure(figsize= (30,50))  # Set the figure size
pos = 1    # a variable to manage the position of the subplot in the overall plot
for feature in df.columns:   # for-loop to iterate over every attribute whose distribution is to be visualized
    plt.subplot(8, 3, pos)   # plot grid
    if feature not in ['class']:   # Plot histogram for all the continuous columns
         sns.distplot(df[feature], kde= True )   
    else:
        sns.countplot(df[feature], palette= 'jet_r')    # Plot bar chart for all the categorical columns
    pos += 1  # to plot over the grid one by one  
  • Compactness,pr.axis_rectangularity,scaled_radius_of_gyration.1,skewness_about.2,radius_ratio,pr.axis_aspect_ratio and scaled_radius_of_gyration are seems to have normal distribution.
  • Skewness_about,skewness_about.1 tends to have left skewness.
  • Circularity,distance_circularity,max.length_aspect_ratio,pr.axis_aspect_ratio,skewness_about.2 and scaled_variance have high number of outliers.
In [15]:
#Boxplot to view outliers of Circularity,distance_circularity,max.length_aspect_ratio,pr.axis_aspect_ratio,
#skewness_about.2 and scaled_variance.
sns.set(style="darkgrid")
#Set up the matplotlib figure
f, axes = plt.subplots(2,3, figsize=(15,12))
sns.despine(left=True)
sns.boxplot(df.circularity, color="r", ax=axes[0,0],)
sns.boxplot(df['max.length_aspect_ratio'], color="g", ax=axes[0,1])
sns.boxplot(df['pr.axis_aspect_ratio'], color="b", ax=axes[0,2])
sns.boxplot(df.scaled_variance, color="r", ax=axes[1,0])
sns.boxplot(df.distance_circularity, color="g", ax=axes[1,1])
axes[1,2].remove();
In [16]:
sns.pairplot(df, hue = 'class', diag_kind='kde',height=3)    
plt.show()
In [17]:
sns.set(style="white")
# Compute the correlation matrix
corr = df.corr()



# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 18))


# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, annot=True,cmap='YlGnBu',  vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x188a86f09c8>

Observation:-

  1. Elongatedness has high correlation with scatter_ratio(95%) and Scaled_variance.1(93%)
  2. We can drop scatter_ratio(95%) and Scaled_variance.1 varialble as these have higher corelation with Elongatedness.
  3. Class variable has 3 categorigal variable , so we have to convert that in one hot encoding before feeding to our model.
In [18]:
adf=df.drop(['scatter_ratio','scaled_variance.1'],axis=1)
adf.head(10)
Out[18]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.000000 83.0 178.000000 72.0 10 42.0 20.0 159 176.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.000000 84.0 141.000000 57.0 9 45.0 19.0 143 170.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.000000 106.0 209.000000 66.0 10 32.0 23.0 158 223.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.000000 82.0 159.000000 63.0 9 46.0 19.0 143 160.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.000000 70.0 205.000000 103.0 52 45.0 19.0 144 241.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 44.828775 106.0 172.000000 50.0 6 26.0 28.0 169 280.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.000000 73.0 173.000000 65.0 6 42.0 19.0 143 176.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.000000 66.0 157.000000 65.0 9 48.0 18.0 146 162.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.000000 62.0 140.000000 61.0 7 54.0 17.0 127 141.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.000000 98.0 168.888095 62.0 11 36.0 22.0 146 202.0 152.0 64.0 4.0 14.0 195.0 204 car
In [19]:
# Converting class variables to Label encoding
from sklearn.preprocessing import LabelEncoder

adf['class_label_encoded'] = LabelEncoder().fit_transform(adf['class'])
In [20]:
adf.head(10)
Out[20]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class class_label_encoded
0 95 48.000000 83.0 178.000000 72.0 10 42.0 20.0 159 176.0 184.0 70.0 6.0 16.0 187.0 197 van 2
1 91 41.000000 84.0 141.000000 57.0 9 45.0 19.0 143 170.0 158.0 72.0 9.0 14.0 189.0 199 van 2
2 104 50.000000 106.0 209.000000 66.0 10 32.0 23.0 158 223.0 220.0 73.0 14.0 9.0 188.0 196 car 1
3 93 41.000000 82.0 159.000000 63.0 9 46.0 19.0 143 160.0 127.0 63.0 6.0 10.0 199.0 207 van 2
4 85 44.000000 70.0 205.000000 103.0 52 45.0 19.0 144 241.0 188.0 127.0 9.0 11.0 180.0 183 bus 0
5 107 44.828775 106.0 172.000000 50.0 6 26.0 28.0 169 280.0 264.0 85.0 5.0 9.0 181.0 183 bus 0
6 97 43.000000 73.0 173.000000 65.0 6 42.0 19.0 143 176.0 172.0 66.0 13.0 1.0 200.0 204 bus 0
7 90 43.000000 66.0 157.000000 65.0 9 48.0 18.0 146 162.0 164.0 67.0 3.0 3.0 193.0 202 van 2
8 86 34.000000 62.0 140.000000 61.0 7 54.0 17.0 127 141.0 112.0 64.0 2.0 14.0 200.0 208 van 2
9 93 44.000000 98.0 168.888095 62.0 11 36.0 22.0 146 202.0 152.0 64.0 4.0 14.0 195.0 204 car 1

New label encoded values of 'class' variable are bus = 0 car = 1 van = 2

In [21]:
#drop class variable 
adf=adf.drop(['class'],axis=1)
adf
Out[21]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class_label_encoded
0 95 48.0 83.0 178.0 72.0 10 42.0 20.0 159 176.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.0 9 45.0 19.0 143 170.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.0 106.0 209.0 66.0 10 32.0 23.0 158 223.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.0 9 46.0 19.0 143 160.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.0 70.0 205.0 103.0 52 45.0 19.0 144 241.0 188.0 127.0 9.0 11.0 180.0 183 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 93 39.0 87.0 183.0 64.0 8 40.0 20.0 134 200.0 149.0 72.0 7.0 25.0 188.0 195 1
842 89 46.0 84.0 163.0 66.0 11 43.0 20.0 159 173.0 176.0 72.0 1.0 20.0 186.0 197 2
843 106 54.0 101.0 222.0 67.0 12 30.0 25.0 173 228.0 200.0 70.0 3.0 4.0 187.0 201 1
844 86 36.0 78.0 146.0 58.0 7 50.0 18.0 124 155.0 148.0 66.0 0.0 25.0 190.0 195 1
845 85 36.0 66.0 123.0 55.0 5 56.0 17.0 128 140.0 131.0 73.0 1.0 18.0 186.0 190 2

846 rows × 17 columns

Spliting the data into test and train

In [22]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score, confusion_matrix

target = adf['class_label_encoded']
features = adf.drop(['class_label_encoded'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.2, random_state = 10)

Scaling the data

In [53]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30,random_state=20)
sc=StandardScaler()
X_train_sd= sc.fit_transform(X_train)
X_test_sd = sc.transform(X_test)

Train a Support vector machine

In [42]:
from sklearn.svm import SVC

# Building a Support Vector Machine on train data with kernel = 'Linear'
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train_sd, y_train)

prediction = svc_model.predict(X_test_sd)
In [43]:
# check the accuracy on the training set
print(svc_model.score(X_train_sd, y_train))
print(svc_model.score(X_test_sd, y_test))
0.9341216216216216
0.9133858267716536
In [96]:
# Building a Support Vector Machine on train data with kernel = 'rbf'
%timeit pass
svc_model = SVC(kernel='rbf')
svc_model.fit(X_train_sd, y_train)

# check the accuracy on the training set
print("Train accuracy-  ",svc_model.score(X_train_sd, y_train))
print("Test accuracy-   ",svc_model.score(X_test_sd, y_test))
6.19 ns ± 0.0521 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Train accuracy-   0.9763313609467456
Test accuracy-    0.9647058823529412
In [46]:
#Building a Support Vector Machine on train data(changing the kernel)
svc_model  = SVC(kernel='poly')
svc_model.fit(X_train_sd, y_train)

prediction = svc_model.predict(X_test_sd)

print(svc_model.score(X_train_sd, y_train))
print(svc_model.score(X_test_sd, y_test))
0.8277027027027027
0.7952755905511811
In [47]:
svc_model = SVC(kernel='sigmoid')
svc_model.fit(X_train_sd, y_train)

prediction = svc_model.predict(X_test_sd)

print(svc_model.score(X_train_sd, y_train))
print(svc_model.score(X_test_sd, y_test))
0.652027027027027
0.6614173228346457

Observation:- SVM gives train accuracy score of 0.97 and test accuracy score of 0.95 (when kernel='rbf').

Apply Principal component analysis on given data set

In [48]:
# generating the covariance matrix and the eigen values for the PCA analysis
cov_matrix = np.cov(X_train_sd.T) # the relevanat covariance matrix
print('Covariance Matrix \n%s', cov_matrix)

#generating the eigen values and the eigen vectors
e_vals, e_vecs = np.linalg.eig(cov_matrix)
print('Eigenvectors \n%s' %e_vecs)
print('\nEigenvalues \n%s' %e_vals)
Covariance Matrix 
%s [[ 1.00169205  0.69321119  0.79890124  0.69243059  0.09249239  0.13955742
  -0.79690271  0.82152003  0.68318293  0.76897355  0.59213989 -0.23117379
   0.22381684  0.15461367  0.30280123  0.36621899]
 [ 0.69321119  1.00169205  0.79317679  0.6201834   0.15597529  0.24311191
  -0.82368325  0.84674873  0.96535831  0.79616518  0.92855143  0.06239254
   0.14998312 -0.01362319 -0.10338834  0.04212554]
 [ 0.79890124  0.79317679  1.00169205  0.75413443  0.13652345  0.25135105
  -0.91737668  0.89854202  0.77531276  0.86393847  0.71178192 -0.20344237
   0.12645075  0.26862085  0.14067952  0.32447352]
 [ 0.69243059  0.6201834   0.75413443  1.00169205  0.66768624  0.47516204
  -0.7781773   0.69870191  0.5744136   0.79470635  0.5281735  -0.10895193
   0.05488438  0.16943603  0.36831224  0.45016774]
 [ 0.09249239  0.15597529  0.13652345  0.66768624  1.00169205  0.7004596
  -0.16430138  0.06935861  0.1379506   0.28382625  0.11623307  0.25463095
  -0.05096827 -0.0364788   0.20182782  0.22350611]
 [ 0.13955742  0.24311191  0.25135105  0.47516204  0.7004596   1.00169205
  -0.17040636  0.14938246  0.30029093  0.33060824  0.18201322  0.37051555
   0.01725412  0.04246609 -0.0412369   0.12435568]
 [-0.79690271 -0.82368325 -0.91737668 -0.7781773  -0.16430138 -0.17040636
   1.00169205 -0.94802439 -0.7819992  -0.93282594 -0.76872825  0.09045558
  -0.05752298 -0.19274942 -0.11593657 -0.21231643]
 [ 0.82152003  0.84674873  0.89854202  0.69870191  0.06935861  0.14938246
  -0.94802439  1.00169205  0.81522771  0.93171771  0.80492813 -0.00891996
   0.08170441  0.21632218 -0.02024176  0.0926891 ]
 [ 0.68318293  0.96535831  0.77531276  0.5744136   0.1379506   0.30029093
  -0.7819992   0.81522771  1.00169205  0.75104284  0.87470529  0.05182069
   0.14012108 -0.00762642 -0.09977622  0.07519065]
 [ 0.76897355  0.79616518  0.86393847  0.79470635  0.28382625  0.33060824
  -0.93282594  0.93171771  0.75104284  1.00169205  0.78362816  0.15099632
   0.03741146  0.19081629  0.009133    0.07571622]
 [ 0.59213989  0.92855143  0.71178192  0.5281735   0.11623307  0.18201322
  -0.76872825  0.80492813  0.87470529  0.78362816  1.00169205  0.2014156
   0.16585204 -0.0552343  -0.22692293 -0.12646916]
 [-0.23117379  0.06239254 -0.20344237 -0.10895193  0.25463095  0.37051555
   0.09045558 -0.00891996  0.05182069  0.15099632  0.2014156   1.00169205
  -0.11071042 -0.12787634 -0.72081254 -0.77431296]
 [ 0.22381684  0.14998312  0.12645075  0.05488438 -0.05096827  0.01725412
  -0.05752298  0.08170441  0.14012108  0.03741146  0.16585204 -0.11071042
   1.00169205 -0.04771326  0.13863382  0.13150541]
 [ 0.15461367 -0.01362319  0.26862085  0.16943603 -0.0364788   0.04246609
  -0.19274942  0.21632218 -0.00762642  0.19081629 -0.0552343  -0.12787634
  -0.04771326  1.00169205  0.06213147  0.20895682]
 [ 0.30280123 -0.10338834  0.14067952  0.36831224  0.20182782 -0.0412369
  -0.11593657 -0.02024176 -0.09977622  0.009133   -0.22692293 -0.72081254
   0.13863382  0.06213147  1.00169205  0.8865507 ]
 [ 0.36621899  0.04212554  0.32447352  0.45016774  0.22350611  0.12435568
  -0.21231643  0.0926891   0.07519065  0.07571622 -0.12646916 -0.77431296
   0.13150541  0.20895682  0.8865507   1.00169205]]
Eigenvectors 
[[ 3.08082465e-01  1.31467019e-01 -1.17666889e-01  3.85097806e-02
   4.12910559e-02  2.29223603e-01  4.93558923e-01 -5.21301983e-01
   4.45333504e-01 -2.41259515e-01  6.33679532e-02 -1.68631929e-02
  -1.92761262e-01  4.94663644e-02  4.59179360e-02  1.81192821e-02]
 [ 3.28755458e-01 -1.33082362e-01 -9.24085133e-02  1.53345459e-01
  -8.00128418e-02 -2.86990182e-01 -2.26816231e-01 -1.91047766e-01
   4.07537298e-02  1.26383645e-01 -2.64507734e-02 -1.49638000e-01
  -4.37988007e-02  3.20579478e-01 -4.84657181e-01 -5.33982691e-01]
 [ 3.38881813e-01  6.71430763e-02 -8.45674395e-02 -1.06697041e-01
   4.47216059e-02 -8.99924045e-02  1.24717047e-01  4.77220536e-01
   7.91677755e-02 -1.41435238e-01  6.46587195e-01 -3.20473997e-01
   1.49288894e-01 -6.51849899e-02  1.24468941e-01 -1.40156183e-01]
 [ 3.03849385e-01  1.47099732e-01  2.85865017e-01 -4.08694649e-02
  -6.11073860e-02  2.57792769e-01 -1.98322445e-01  1.29861507e-01
   2.47102552e-01 -4.01464050e-03 -1.53830488e-01 -9.56087495e-03
   1.14914755e-01 -5.55947909e-01 -5.00430593e-01  1.36805688e-01]
 [ 1.05328523e-01  3.57914918e-02  6.42857809e-01  7.03233764e-02
  -4.79695976e-02  1.64623798e-01 -4.09820248e-01 -6.81383051e-02
   2.92410502e-01  3.36307607e-02  1.22303630e-01  9.63879456e-03
  -1.79557944e-02  3.36218496e-01  3.87731980e-01 -4.50377582e-02]
 [ 1.22982708e-01 -7.60923538e-02  5.72673022e-01  4.24905335e-02
   1.90930024e-01 -4.34024762e-01  4.93027429e-01  1.48571457e-01
  -1.37107654e-01 -1.29115549e-01 -3.17511957e-01 -9.61128312e-02
  -7.72423237e-02 -4.11791093e-02  1.18037176e-02 -7.76002833e-02]
 [-3.45352517e-01 -1.07388683e-02  8.93701024e-02  1.09236416e-01
   7.72956554e-02 -1.79960158e-01  1.93804649e-02 -2.55451291e-01
   1.79458229e-01 -3.02960224e-01  1.07585194e-01 -7.52258267e-02
   7.66219724e-01  4.98414044e-02 -1.55780036e-01  4.04668029e-02]
 [ 3.42450470e-01 -5.85384206e-02 -1.46614063e-01 -1.22128705e-01
  -2.99591213e-03  1.45472473e-01  1.03993819e-01  7.85488686e-02
   6.32568654e-03  1.68262126e-01 -3.22188413e-01  3.77981411e-01
   4.91696278e-01 -7.72971674e-02  3.56110449e-01 -3.93893080e-01]
 [ 3.19237550e-01 -1.23599079e-01 -7.93535808e-02  1.55463972e-01
  -6.49233488e-02 -4.47587671e-01 -6.43536031e-02 -2.60622465e-01
   6.31812157e-02  4.76627411e-01  7.23175699e-02 -9.65033728e-02
   1.28812609e-01 -1.86323499e-01  2.10607629e-01  4.84103711e-01]
 [ 3.41208594e-01 -8.04028180e-02  3.19333957e-02 -1.32040512e-01
  -8.03725609e-03  2.93949966e-01  1.20335875e-01  8.94292214e-02
  -3.32046553e-01 -2.32058042e-02 -7.13155448e-02 -4.05900937e-02
   2.29926100e-01  5.55619278e-01 -2.38691236e-01  4.62200141e-01]
 [ 3.03523952e-01 -2.18882637e-01 -1.13341668e-01  1.64082456e-01
  -4.97131071e-02 -1.17302551e-01 -3.52054050e-01 -1.09324051e-01
  -2.68964034e-01 -7.08076041e-01 -7.86187178e-02  9.73104857e-02
  -7.79872019e-02 -1.70398895e-01  1.89266166e-01  8.81254706e-02]
 [-1.51281317e-02 -5.04520795e-01  2.63385572e-01 -4.10025518e-02
   1.36417233e-01  2.56184407e-01  1.19475241e-01 -3.18301582e-01
  -3.47978494e-01  1.54507746e-01  4.52631074e-01  1.75871401e-01
  -2.26021025e-03 -2.45215866e-01 -1.03242229e-01 -1.50944344e-01]
 [ 5.28025539e-02  8.08289447e-02 -1.11080466e-01  5.90068402e-01
   7.52679630e-01  1.69953915e-01 -8.32410910e-02  1.30414907e-01
  -7.38231442e-04  9.33254893e-02 -2.18752861e-02  2.22483836e-02
   1.43866977e-03  2.32515578e-02 -7.37148381e-03  2.30248502e-02]
 [ 6.00879315e-02  1.30638882e-01 -3.07899303e-02 -7.01817973e-01
   5.78072987e-01 -1.55784279e-01 -2.52032605e-01 -2.28876030e-01
  -1.28627493e-02 -1.91286856e-02 -4.74638968e-02 -8.49875119e-02
  -3.59040270e-02 -4.68695052e-03  1.23037575e-02  3.89106584e-03]
 [ 4.15868969e-02  5.40034891e-01  7.89224612e-02  1.24141145e-01
  -1.20738202e-01  1.69634794e-01 -6.39500353e-03 -2.97972047e-01
  -5.10628298e-01  7.39463874e-02 -4.33679677e-02 -4.46091959e-01
   1.12314335e-01 -1.41095247e-01  1.50050868e-01 -1.75172398e-01]
 [ 9.08871578e-02  5.37416161e-01  9.48197671e-02  4.47274228e-02
  -2.80446024e-02 -2.71017268e-01  2.28024041e-02 -4.03859324e-02
  -1.66004813e-01 -2.00484538e-02  3.10500259e-01  6.77195108e-01
  -2.88792936e-02  7.20472037e-02 -1.58074268e-01  3.96078344e-02]]

Eigenvalues 
[7.56864082 2.94727879 1.97581529 1.17446019 0.89518816 0.49073722
 0.32754845 0.21593634 0.15769162 0.08007019 0.05499075 0.04378233
 0.03299452 0.01556973 0.02672667 0.01964168]
In [49]:
# the "cumulative variance explained" analysis 
tot = sum(e_vals)
var_exp = [( i /tot ) * 100 for i in sorted(e_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 47.22409972  65.61347648  77.94146251  85.26943941  90.85491452
  93.9168412   95.96056094  97.30788334  98.29179115  98.79138449
  99.13449612  99.40767346  99.61354089  99.78030042  99.90285358
 100.        ]
In [50]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.figure(figsize=(10 , 5))
plt.bar(range(1, e_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, e_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
In [51]:
eigen_pairs = [(np.abs(e_vals[i]), e_vecs[:,i]) for i in range(len(e_vals))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:17]
Out[51]:
[(7.568640820733027,
  array([ 0.30808246,  0.32875546,  0.33888181,  0.30384939,  0.10532852,
          0.12298271, -0.34535252,  0.34245047,  0.31923755,  0.34120859,
          0.30352395, -0.01512813,  0.05280255,  0.06008793,  0.0415869 ,
          0.09088716])),
 (2.9472787931136826,
  array([ 0.13146702, -0.13308236,  0.06714308,  0.14709973,  0.03579149,
         -0.07609235, -0.01073887, -0.05853842, -0.12359908, -0.08040282,
         -0.21888264, -0.50452079,  0.08082894,  0.13063888,  0.54003489,
          0.53741616])),
 (1.9758152917864904,
  array([-0.11766689, -0.09240851, -0.08456744,  0.28586502,  0.64285781,
          0.57267302,  0.0893701 , -0.14661406, -0.07935358,  0.0319334 ,
         -0.11334167,  0.26338557, -0.11108047, -0.03078993,  0.07892246,
          0.09481977])),
 (1.174460189704671,
  array([ 0.03850978,  0.15334546, -0.10669704, -0.04086946,  0.07032338,
          0.04249053,  0.10923642, -0.1221287 ,  0.15546397, -0.13204051,
          0.16408246, -0.04100255,  0.5900684 , -0.70181797,  0.12414114,
          0.04472742])),
 (0.8951881590469094,
  array([ 0.04129106, -0.08001284,  0.04472161, -0.06110739, -0.0479696 ,
          0.19093002,  0.07729566, -0.00299591, -0.06492335, -0.00803726,
         -0.04971311,  0.13641723,  0.75267963,  0.57807299, -0.1207382 ,
         -0.0280446 ])),
 (0.4907372162016254,
  array([ 0.2292236 , -0.28699018, -0.0899924 ,  0.25779277,  0.1646238 ,
         -0.43402476, -0.17996016,  0.14547247, -0.44758767,  0.29394997,
         -0.11730255,  0.25618441,  0.16995392, -0.15578428,  0.16963479,
         -0.27101727])),
 (0.3275484497067706,
  array([ 0.49355892, -0.22681623,  0.12471705, -0.19832245, -0.40982025,
          0.49302743,  0.01938046,  0.10399382, -0.0643536 ,  0.12033587,
         -0.35205405,  0.11947524, -0.08324109, -0.2520326 , -0.006395  ,
          0.0228024 ])),
 (0.21593634176439339,
  array([-0.52130198, -0.19104777,  0.47722054,  0.12986151, -0.06813831,
          0.14857146, -0.25545129,  0.07854887, -0.26062247,  0.08942922,
         -0.10932405, -0.31830158,  0.13041491, -0.22887603, -0.29797205,
         -0.04038593])),
 (0.15769162127361439,
  array([ 0.4453335 ,  0.04075373,  0.07916778,  0.24710255,  0.2924105 ,
         -0.13710765,  0.17945823,  0.00632569,  0.06318122, -0.33204655,
         -0.26896403, -0.34797849, -0.00073823, -0.01286275, -0.5106283 ,
         -0.16600481])),
 (0.08007018689106539,
  array([-0.24125951,  0.12638364, -0.14143524, -0.00401464,  0.03363076,
         -0.12911555, -0.30296022,  0.16826213,  0.47662741, -0.0232058 ,
         -0.70807604,  0.15450775,  0.09332549, -0.01912869,  0.07394639,
         -0.02004845])),
 (0.05499075161848344,
  array([ 0.06336795, -0.02645077,  0.6465872 , -0.15383049,  0.12230363,
         -0.31751196,  0.10758519, -0.32218841,  0.07231757, -0.07131554,
         -0.07861872,  0.45263107, -0.02187529, -0.0474639 , -0.04336797,
          0.31050026])),
 (0.043782329929622485,
  array([-0.01686319, -0.149638  , -0.320474  , -0.00956087,  0.00963879,
         -0.09611283, -0.07522583,  0.37798141, -0.09650337, -0.04059009,
          0.09731049,  0.1758714 ,  0.02224838, -0.08498751, -0.44609196,
          0.67719511])),
 (0.032994523012908596,
  array([-0.19276126, -0.0437988 ,  0.14928889,  0.11491476, -0.01795579,
         -0.07724232,  0.76621972,  0.49169628,  0.12881261,  0.2299261 ,
         -0.0779872 , -0.00226021,  0.00143867, -0.03590403,  0.11231433,
         -0.02887929])),
 (0.0267266718203322,
  array([ 0.04591794, -0.48465718,  0.12446894, -0.50043059,  0.38773198,
          0.01180372, -0.15578004,  0.35611045,  0.21060763, -0.23869124,
          0.18926617, -0.10324223, -0.00737148,  0.01230376,  0.15005087,
         -0.15807427])),
 (0.019641684124411373,
  array([ 0.01811928, -0.53398269, -0.14015618,  0.13680569, -0.04503776,
         -0.07760028,  0.0404668 , -0.39389308,  0.48410371,  0.46220014,
          0.08812547, -0.15094434,  0.02302485,  0.00389107, -0.1751724 ,
          0.03960783])),
 (0.01556972730922174,
  array([ 0.04946636,  0.32057948, -0.06518499, -0.55594791,  0.3362185 ,
         -0.04117911,  0.0498414 , -0.07729717, -0.1863235 ,  0.55561928,
         -0.17039889, -0.24521587,  0.02325156, -0.00468695, -0.14109525,
          0.0720472 ]))]
In [ ]:
# # generating dimensionally reduced datasets
# w = np.hstack((eigen_pairs[0][3].reshape(16,4), 
#                       eigen_pairs[1][3].reshape(16,4),
#                      eigen_pairs[2][3].reshape(16,4),
#                       eigen_pairs[3][3].reshape(16,4)))
                     
# print('Matrix W:\n', w)
# X_sd_pca = X_train_sd.dot(w)
# X_test_sd_pca = X_test_sd.dot(w)

We can get 95% accuracy using 7 eigen vectors.

Spliting the data into train and test

In [75]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score, confusion_matrix

# Splitting the data into train and test with same random state value.

target = adf['class_label_encoded']
features = adf.drop(['class_label_encoded'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.2, random_state = 10)

Scaling the data

In [76]:
# scaling the data using the standard scaler
from sklearn.preprocessing import StandardScaler
X_train_sd = StandardScaler().fit_transform(X_train)
X_test_sd = StandardScaler().fit_transform(X_test)

Tranform train and test data into eigen vectors

In [77]:
from sklearn.decomposition import PCA

pca = PCA(n_components=7)
X_train_pca = pca.fit_transform(X_train_sd)
X_test_pca = pca.transform(X_test_sd)

Train SVM with new data after PCA

In [78]:
from sklearn.svm import SVC

# Building a Support Vector Machine on train data with kernel = 'Linear'
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train_pca, y_train)

prediction = svc_model.predict(X_test_pca)
In [79]:
# check the accuracy on the training set
print(svc_model.score(X_train_pca, y_train))
print(svc_model.score(X_test_pca, y_test))
0.8224852071005917
0.6882352941176471
In [97]:
# Building a Support Vector Machine on train data with kernel = 'rbf'
%timeit pass
svc_model = SVC(kernel='rbf')
svc_model.fit(X_train_pca, y_train)

# check the accuracy on the training set
print("Train accuracy -  ",svc_model.score(X_train_pca, y_train))
print("Test accuracy-    ",svc_model.score(X_test_pca, y_test))
6.11 ns ± 0.0695 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Train accuracy -   0.9511834319526628
Test accuracy-     0.9058823529411765
In [82]:
#Building a Support Vector Machine on train data(changing the kernel)
svc_model = SVC(kernel='poly')
svc_model.fit(X_train_pca, y_train)


print(svc_model.score(X_train_pca, y_train))
print(svc_model.score(X_test_pca, y_test))
0.8949704142011834
0.7941176470588235
In [83]:
svc_model = SVC(kernel='sigmoid')
svc_model.fit(X_train_pca, y_train)

prediction = svc_model.predict(X_test_pca)

print(svc_model.score(X_train_pca, y_train))
print(svc_model.score(X_test_pca, y_test))
0.5133136094674556
0.5764705882352941

Observation :- When we reduce dimension from 16 to 7 using PCA, we get train accuracy = 0.96 and test accuracy = 0.89 (with kernel = 'rbf').

Comparing the accuracy before PCA and after PCA

In [100]:
scores = {' ': ['Before PCA','After PCA'],
          ' Train accuracy ' : [0.97,0.96],
          ' Test accuracy': [0.95,0.89],
          ' Execution time': ['6.19 ns','6.11 ns']
          
        
        }

df = pd.DataFrame(scores)

print (df)
                Train accuracy    Test accuracy  Execution time
0  Before PCA              0.97            0.95         6.19 ns
1   After PCA              0.96            0.89         6.11 ns

After reducing dimention of the data from 16 variable to 7 variable we get 0.89 test accuracy with SVM . Which is 0.07 less than the test accuracy before PCA. So compromising 0.07 accuracy from droping 9 column is acceptable in terms of execution time when dealing with large data. After PCA the execution time droped from 6.19 to 6.11 ns. After PCA we save 0.07 ns of execution time.

In [ ]:
 
In [ ]: